Refactor XPath evaluation and optimize memory allocation#1
Merged
Conversation
Memory leak fixes:
- StreamingParser: buffer, events, and complete_elements vecs never
shrank after drain operations, causing unbounded growth in long-lived
parsers. Added shrink_to() calls matching StreamingSaxParser behavior.
- StructuralIndex: shrink_to_fit() existed but was never called after
building. Initial capacity estimates over-allocate by 2-3x; now
reclaimed immediately after build_children_from_parents().
- DocumentAccumulator: reduced default pre-allocation from 64KB to 4KB.
The old 64KB multiplied quickly across concurrent accumulators.
Connection/contention fix:
- XPath cache: replaced deep CompiledExpr cloning with Arc<CompiledExpr>.
Every cache hit previously cloned all Vec<Op>, Strings, and
Box<CompiledExpr> recursively. Now it's a cheap Arc pointer bump.
Correctness + memory fix:
- compare_values in XPath eval: was using format!("{}", node_id) to
compare nodes, which compared raw u32 IDs as strings — both wrong
per XPath 1.0 spec and wasteful (O(n*m) String allocations). Now
uses actual document text content via node_string_value().
- Consolidated duplicated get_node_text_content/collect_text_content
from lib.rs into shared dom::node_string_value().
Safety fix:
- empty_binary: replaced .unwrap() on OwnedBinary::new(0) with match
to avoid potential NIF panic.
Best practices:
- Added #[must_use] to evaluate(), evaluate_from_node(), compile(),
validate_strict(), and XPathValue type.
https://claude.ai/code/session_015igpdCrNYKuoPrHWZ5RXYc
…e all clippy warnings
- Fix XPathValue::to_string_value() to return empty string for NodeSets
instead of misleading format!("[node:{}]", id) — callers with document
access now use dom::node_string_value() per XPath 1.0 spec
- Add resolve_string() helper to XPath functions for proper NodeSet-to-string
conversion; pass document access to all 8 string functions
- Replace duplicated get_string_value/collect_text with dom::node_string_value
- Remove residual get_node_text_content wrapper from NIF layer
- Optimize NodeSet-vs-NodeSet comparison from O(n*m) to O(n+m) by
pre-computing right-side string values
- Fix streaming parser shrink_to() reallocation churn with 4x threshold
- Remove redundant #[must_use] on compile() (Result already has it)
- Remove dead inherent methods on XmlDocument duplicated by DocumentAccess trait
- Remove unused XmlAttribute re-export from dom/mod.rs
- Fix len_zero clippy warnings in unified_scanner tests
- Add #[expect(clippy::ptr_arg)] to intern_cow (needs Cow for zero-copy optimization)
- Add 4 correctness tests for NodeSet equality and string function semantics
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace expect() in XPath parser peek() with match + defensive Eof fallback — eliminates BEAM VM crash risk if invariant is violated - Replace expect() in XPath lexer read_string() with unwrap_or defensive fallback — same rationale - Thread document access into compare_numbers() so relational operators (< <= > >=) properly resolve NodeSet text content before numeric conversion — fixes silent wrong results for expressions like /r/price > 10 where <price>42.5</price> - Add resolve_number() helper mirroring resolve_string() pattern - Add test for relational operators on NodeSets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR refactors XPath evaluation logic, consolidates duplicate code, and optimizes memory allocation patterns throughout the codebase. The changes improve correctness of XPath comparisons, reduce unnecessary cloning, and prevent unbounded memory growth in long-lived parsers and accumulators.
Key Changes
XPath Evaluation Improvements
get_node_text_contentandcollect_text_contentlogic into a shareddom::node_string_valuefunction to eliminate duplication and ensure consistent XPath string-value semantics across the codebasecompare_valuesto use actual node string-values instead of formatting raw node IDs as numbers, which was both semantically incorrect and wasteful#[must_use]attributes: Applied toevaluate,evaluate_from_node,compile, andXPathValueto catch unused results at compile timeMemory Optimization
Arc<CompiledExpr>instead of cloning entire compiled expressions on cache hits, reducing allocations from deep clones to cheap pointer bumpsshrink_to()calls after draining buffers to prevent unbounded capacity growth in long-lived parserseventsandcomplete_elementsvectors after partial drains to release excess capacityshrink_to_fit()call after building document indices to reclaim over-allocated capacity from initial size estimatesError Handling
empty_binaryto handle potential allocation failures gracefully instead of panicking in a BEAM NIFDocumentation
https://claude.ai/code/session_015igpdCrNYKuoPrHWZ5RXYc